This document summarises the data from https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.001081. This study was produced by Michael Hall from Zam’s group and describes the development of the DRPRG tool that allows for drug resistance prediction using genome graphs.
In doing so, Michael collated and carefully(!) curated an enriched WHO dataset, which is summarised here.
(this excludes 527 instances of duplicated biosample IDs with different runIDs).
There is at least one drug phenotype for each sample for 22 drugs: amikacin, bedaquiline, capreomycin, ciprofloxacin, clofazimine, cycloserine, delamanid, ethambutol, ethionamide, gatifloxacin, isoniazid, kanamycin, levofloxacin, linezolid, moxifloxacin, ofloxacin, para-aminosalicylic_acid, pyrazinamide, rifabutin, rifampicin, streptomycin, thioacetazone.
A breakdown of R/S phenotypes for the samples can be seen in the plot below:
pheno_data %>%
select(-run, -bioproject, -biosample) %>%
summarise(across(everything(), ~ list(R = sum(. == "R"), S = sum(. == "S")))) %>%
mutate(phenotype = c("R", "S")) %>%
pivot_longer(-phenotype, names_to = "antibiotic", values_to = "value") %>%
as.data.frame() %>%
mutate(value = as.numeric(as.character(value))) %>%
mutate(antibiotic = case_when(antibiotic == "para-aminosalicylic_acid" ~ "PAS",
TRUE ~ antibiotic)) %>%
ggplot(., aes(x = antibiotic, y = value)) +
geom_bar(aes(fill = phenotype), position = "dodge",stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0, size = 8)) +
geom_text(
stat = "identity",
aes(label = value, group = phenotype),
vjust = -1,
position = position_dodge(width = 0.9),
size = 2.5) +
ylab("# samples") +
labs(title = "Data used for DRPRG: WHO samples + extra samples",
subtitle = paste0("n = ", n_samples, " samples"),
caption = "There is at least one drug phenotype for each sample")
The WHO data (minus the “enriched” part) was used to train the WHO catalogue. The “enriched” dataset minus the WHO data (e.g. the data from other publications that wasn’t included in the WHO training data) is as follows:
A breakdown of R/S phenotypes for the samples can be seen in the plot below:
not_WHO_data %>%
select(-run, -bioproject, -biosample) %>%
summarise(across(everything(), ~ list(R = sum(. == "R"), S = sum(. == "S")))) %>%
mutate(phenotype = c("R", "S")) %>%
pivot_longer(-phenotype, names_to = "antibiotic", values_to = "value") %>%
as.data.frame() %>%
mutate(value = as.numeric(as.character(value))) %>%
mutate(antibiotic = case_when(antibiotic == "para-aminosalicylic_acid" ~ "PAS",
TRUE ~ antibiotic)) %>%
ggplot(., aes(x = antibiotic, y = value)) +
geom_bar(aes(fill = phenotype), position = "dodge",stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0, size = 8)) +
geom_text(
stat = "identity",
aes(label = value, group = phenotype),
vjust = -1,
position = position_dodge(width = 0.9),
size = 2.5) +
ylab("# samples") +
labs(title = "Data used for DRPRG: extra samples only",
subtitle = paste0("n = ", n_samples_not_WHO, " samples"),
caption = "There is at least one drug phenotype for each sample")
How many CRyPTIC samples in the WHO training dataset?
cryptic = fread("/Users/kmalone/macbook_m1_backup/github/cryptic-catalogue/data/CRyPTIC_reuse_table_20231107.csv")
# mismatch_WHO_cryptic_run = cryptic %>%
# dplyr::rename("run" = ENA_RUN) %>%
# anti_join(., who, by = "run")
#
# mismatch_WHO_cryptic_run_dedup = mismatch_WHO_cryptic_run %>%
# separate(run, into = c("run1", "run2"), sep = "\\.", extra = "merge") %>%
# pivot_longer(cols = starts_with("run"), names_to = "run_type", values_to = "ENA_RUN") %>%
# filter(ENA_RUN != "") %>%
# select(-run_type)
#
# mismatch_WHO_cryptic_run_dedup %>%
# dplyr::rename("run" = ENA_RUN) %>%
# anti_join(., who, by = "run")
#
# mismatch_WHO_cryptic_ENA = cryptic %>%
# dplyr::rename("sample" = ENA_SAMPLE) %>%
# anti_join(., who, by = "sample")
mismatch_pheno_cryptic_run = cryptic %>%
separate(ENA_RUN, into = c("run1", "run2"), sep = "\\.", extra = "merge") %>%
pivot_longer(cols = starts_with("run"), names_to = "run_type", values_to = "run") %>%
filter(run != "") %>%
anti_join(., pheno_data, by = "run") %>%
dplyr::rename("biosample" = ENA_SAMPLE)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 12163 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
mismatch_pheno_cryptic_ENA = cryptic %>%
dplyr::rename("biosample" = ENA_SAMPLE) %>%
anti_join(., pheno_data, by = "biosample") %>%
dplyr::rename("run" = ENA_RUN)
mismatch_pheno_cryptic_ENA %>%
separate(run, into = c("run1", "run2"), sep = "\\.", extra = "merge") %>%
pivot_longer(cols = starts_with("run"), names_to = "run_type", values_to = "run") %>%
filter(run != "") %>%
anti_join(., pheno_data, by = "run") %>%
select(run, biosample)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3392 rows [1, 2, 3, 4, 5,
## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
## # A tibble: 981 × 2
## run biosample
## <chr> <chr>
## 1 ERR4810795 ERS5298822
## 2 ERR4810696 ERS5298723
## 3 ERR4810731 ERS5298758
## 4 ERR4810730 ERS5298757
## 5 ERR4810729 ERS5298756
## 6 ERR4811033 ERS5300751
## 7 ERR4810668 ERS5298695
## 8 ERR4810689 ERS5298716
## 9 ERR4810721 ERS5298748
## 10 ERR4811028 ERS5300746
## # ℹ 971 more rows
#sites
mismatch_pheno_cryptic_ENA %>%
separate(run, into = c("run1", "run2"), sep = "\\.", extra = "merge") %>%
pivot_longer(cols = starts_with("run"), names_to = "run_type", values_to = "run") %>%
filter(run != "") %>%
anti_join(., pheno_data, by = "run") %>%
dplyr::mutate(x = str_split(UNIQUEID, "\\.subj") %>% map_chr(1)) %>%
with(table(.$x))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3392 rows [1, 2, 3, 4, 5,
## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
##
## site.02 site.03 site.04 site.05 site.06 site.08 site.10 site.11 site.14 site.17
## 82 22 3 2 357 36 4 456 12 5
## site.20
## 2
According to the README.md in the data source, the phenotype data can
be found in config/illumina.samplesheet.csv.
It was synthesized by summarising the experiments found in
config/samplesheets/ using
workflow/notebook/notepad.ipynb. This notebook is large and
includes various analyses but the one we are interested in is “Data
Collation”.
The first steps involved creating a WHO base dataset with two different data sources.
“gentb”, which is labelled as WHO correspondence. This is the data underlying Maha Farhat’s GenTB paper.
“WHO”, which I’m assuming is from the latest catalogue.
Lots of data cleaning, resolving discrepancies between the two datasets and extra metadata added using code from https://github.com/mbhall88/WHO-correspondence/blob/main/docs/fill_in_who_samplesheet.py.
Next, data was sequentially added and cleaned data from various publications:
| dataset | reference | phenotype method |
|---|---|---|
| gentb | https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-021-00953-4 | various |
| trisakil | https://doi.org/10.1080/22221751.2022.2099304 | various |
| smith | https://pubmed.ncbi.nlm.nih.gov/33055186/ | liquid MGIT 960 system (Bactec MGIT SIRE and PZA package inserts; Becton, Dickinson) and solid 7H10 agar proportion method |
| peker | https://doi.org/10.1099/mgen.0.000695 | incredibly not noted, only reference in methods is 'All MTB isolates were phenotypically tested for drug susceptibility (phenotypic DST)' |
| merker | https://doi.org/10.1038/s41467-022-32455-1 | various |
| finci | https://pubmed.ncbi.nlm.nih.gov/35907429/ | BACTEC MGIT 960 DST and Sensititre MYCOTB MIC plate (binary results reported) |
| leah_bdq | https://doi.org/10.1101/2022.12.08.519610 | BACTEC MGIT 960 DST |
| marco_pheno | https://doi.org/10.3389/fmicb.2023.1104456 | not stated, just that they were 'according to WHO classification' |
| lempens_acc | https://doi.org/10.1016/j.ijid.2020.08.042 | LJ slopes and 7H11 plates, proportional |
Then, the following was noted: “Get the BioProject of all BioSamples with antibiogram data in NCBI. Once I have the BioProject, I can …..download the antibiogram table”.
This resulted in an extra 1073 samples being added to the superset with phenotypes for at least one drug.
| bioproj | reference | phenotype method |
|---|---|---|
| PRJNA353873 | MGIT, MICs listed | |
| PRJNA413593 | MGIT, MICs listed | |
| PRJNA438921 | MGIT, MICs listed | |
| PRJNA557083 | MGIT, MICs listed | |
| PRJNA650381 | MGIT and proportional agar, MICs listed for both | |
| PRJNA663350 | MGIT, MICs listed | |
| PRJNA717333 | 96 well plate, MICs listed | |
| PRJNA824124 | MGIT, MICs listed | |
| PRJNA834625 | LJ slopes, MICs listed | |
| PRJNA888434 | MGIT, MICs listed |
All data together:
| dataset | reference | phenotype method |
|---|---|---|
| gentb | https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-021-00953-4 | various |
| trisakil | https://doi.org/10.1080/22221751.2022.2099304 | various |
| smith | https://pubmed.ncbi.nlm.nih.gov/33055186/ | liquid MGIT 960 system (Bactec MGIT SIRE and PZA package inserts; Becton, Dickinson) and solid 7H10 agar proportion method |
| peker | https://doi.org/10.1099/mgen.0.000695 | incredibly not noted, only reference in methods is 'All MTB isolates were phenotypically tested for drug susceptibility (phenotypic DST)' |
| merker | https://doi.org/10.1038/s41467-022-32455-1 | various |
| finci | https://pubmed.ncbi.nlm.nih.gov/35907429/ | BACTEC MGIT 960 DST and Sensititre MYCOTB MIC plate (binary results reported) |
| leah_bdq | https://doi.org/10.1101/2022.12.08.519610 | BACTEC MGIT 960 DST |
| marco_pheno | https://doi.org/10.3389/fmicb.2023.1104456 | not stated, just that they were 'according to WHO classification' |
| lempens_acc | https://doi.org/10.1016/j.ijid.2020.08.042 | LJ slopes and 7H11 plates, proportional |
| PRJNA353873 | MGIT, MICs listed | |
| PRJNA413593 | MGIT, MICs listed | |
| PRJNA438921 | MGIT, MICs listed | |
| PRJNA557083 | MGIT, MICs listed | |
| PRJNA650381 | MGIT and proportional agar, MICs listed for both | |
| PRJNA663350 | MGIT, MICs listed | |
| PRJNA717333 | 96 well plate, MICs listed | |
| PRJNA824124 | MGIT, MICs listed | |
| PRJNA834625 | LJ slopes, MICs listed | |
| PRJNA888434 | MGIT, MICs listed |